PKU-ICST at TRECVID 2009: High Level Feature Extraction and Search
نویسندگان
چکیده
We participate in two tasks of TRECVID 2009: high-level feature extraction (HLFE) and search. This paper presents our approaches and results in the two tasks. In HLFE task, we mainly focus on exploring the effective feature representation, data imbalance learning and fusion between different data sets. In feature representation, we adopt five basic visual features and six keypoint-based BoW features, and combine them to represent each keyframe image. In imbalance learning, we propose two methods for this problem: OnUm and concept category. In the fusion between different data sets, we use three different training sets: (1) TRECVID 2009 training data set (Tv09), (2) TRECVID 2005 training data set (Tv05), and (3) Flickr images. In search task, we participate in two types of search tasks: automatic search and manual search. We explore multimodal feature representation, which includes visual-based features, concept-based feature, audio features and face features. Based on these features, two retrieval methods are jointly adopted for search task: pair-wise similarity measure and learning-based ranking. We achieve the good results in both tasks. In HLFE task, official evaluation shows that our team ranks 2 in type A and 1 in types C, a and c. In Search task, official evaluations show that our team rank 2 in automatic search and 1 in manual search. 1 High Level Feature Extraction In the HLFE task of TRECVID 2009, we participate in all 4 types of evaluation. The National institute of standards and technology (NIST) totally defines 4 types of runs according to the used training data: A, a, C, and c. Type-a runs only use non-SV TRECVID data, while type-A runs can use all TRECVID data (SV and non-SV). SV data refers to the data sets of TRECVID 2007, 2008 and 2009, because they are donated by Sound and Vision organization (SV). And TRECVID data in other years are called non-SV data. Both type-a and type-A runs are not allowed to use non-TRECVID training data (e.g., web images). Type-c runs can use any training data except SV data, while type-C runs have not any limitation in the training data. In our six submitted runs, the 1 run belongs to type C, the 2, 3 and 4 runs belong to type A, the 5 run belongs to type c, and the 6 run belongs to type a. They are described as follows: C-PKU-ICST-HLFE-1 (A-PKU-ICST-HLFE-3 + c-PKU-ICST-HLFE-5): weighted fusion of A-PKU-ICST-HLFE-3 and c-PKU-ICST-HLFE-5. A-PKU-ICST-HLFE-2 (A-PKU-ICST-HLFE-3 + a-PKU-ICST-HLFE-6): weighted fusion of A-PKU-ICST-HLFE-3 and a-PKU-ICST-HLFE-6. A-PKU-ICST-HLFE-3 (visual feature + O3U3 + Tv09 + audio feature + concept category): early fusion of five basic visual features and six keypoint-based BoW features, and audio features are used for a few related concepts. Training by O3U3 classifier on Tv09 data, and utilizing concept category. A-PKU-ICST-HLFE-4 (visual feature + O3U3 + Tv09): early fusion of five basic visual features and six keypoint-based BoW features, and trained by O3U3 classifier on Tv09 data. c-PKU-ICST-HLFE-5 (visual feature + O2U2 + Flickr + a-PKU-ICST-HLFE-6): early fusion of five basic visual features and six keypoint-based BoW features, trained by O2U2 classifier on Flickr data, utilizing concept category, and fused with a-PKU-ICST-HLFE-6. a-PKU-ICST-HLFE-6 (visual feature + O2U2 + Tv05): early fusion of five basic visual features and six keypoint-based BoW features, trained by O2U2 classifier on a subset of Tv05 data, and utilizing concept category. The evaluation results of our 6 runs are shown in Table 1. Official evaluation shows: in type-A runs, our team ranks 2 in all 41 teams that submitted type-A runs (our best run ranks 4 among all 202 type-A runs of 41 teams, and the first three runs belong to the same team). In types C, a, and c runs, all our runs rank 1. Table 1: Results of our submitted 6 runs on HLFE task of TRECVID 2009. ID MAP Brief description C-PKU-ICST-HLFE-1 0.205 A-PKU-ICST-HLFE-3+ c-PKU-ICST-HLFE-5 A-PKU-ICST-HLFE-2 0.203 A-PKU-ICST-HLFE-3+ a-PKU-ICST-HLFE-6 A-PKU-ICST-HLFE-3 0.199 Visual feature+O3U3+Tv09+audio feature+concept category A-PKU-ICST-HLFE-4 0.198 Visual feature+O3U3+Tv09 c-PKU-ICST-HLFE-5 0.120 Visual feature+O2U2+Flickr+concept category +a-PKU-ICST-HLFE-6 a-PKU-ICST-HLFE-6 0.092 Visual feature+O2U2+Tv05+concept category The framework of our HLFE system is shown in Figure 1. Besides the training data set from TRECVID 2009(Tv09), we also use two other data sets: the TRECVID 2005 training data set(Tv05), and the web images downloaded from Flickr website (Flickr). For each of three training sets: (1) the same visual features are extracted, (2) the same OnUm algorithm is adopted (with different parameters) to handle the data imbalance problem, and (3) the same concept category method is used to exploit the inter-concept correlation. Audio features are only used in the TRECVID 2009 training data set. For the test set, five keyframes are uniformly extracted from each subshot and the same visual features are extracted. As shown in Figure 1, the six submitted runs are the separate or combined results of the three training data sets. Our sixth run a-PKU-ICST-HLFE-6(Run6) only uses Tv05 data, while Run5 combines Tv05 data and Flickr data, and gains a big performance improvement over Run6. Run4, our baseline type-A run, without using the concept category method, already performs much better than Run6 and Run5, because both training data of Run4 and test data set are from the TRECVID 2009 data sets and have similar video content. In Run3, the usage of audio features and inter-concept correlation only has a slight improvement compared to Run4. Run2 combines the result of Run3 and Run6, while Run1 combines the result of Run3 and Run5, both gaining considerable increases over the separate results. This shows that the three training data sets are complementary for HLFE task. Figure 1: Framework of our HLFE approach for the submitted six runs. 1.1 Feature Representation We use three kinds of features for the HLFE tasks, namely basic visual features, keypoint-based BoW features, and audio features. The basic visual features and keypoint-based BoW features are used for all 20 concepts, while the audio features are only used for three related concepts on Person-playing-a-musical-instrument, Female-human-face-closeup, and Singing. 1.1.1 Basic visual features We extract five basic visual features namely CMG(Color Moment Grid), LBP(Local Binary Pattern), Gabor(Gabor wavelet texture), EHL(Edge Histogram Layout) and EOH(Edge Orientation Histogram) from each keyframe image. The details of these visual features are given as follows: (1) CMG (225-d): the image is divided into sub-images by a 5x5 grid in the CIE-Lab color space. The color moments of the 1st, 2nd and 3rd orders are extracted from these sub-images in each channel. (2) LBP (531-d): it depicts the relationship of the center pixel and P equally spaced pixels on a circle of radius R in a gray-scale image. We first divide the gray-scale image into sub-image by a 3x3 grid, and then choose a neighborhood size of 8(P = 8) equally spaced pixels on a circle of radius 1(R = 1) that form a circularly symmetric neighbor set with “uniform” patterns . (3) Gabor wavelet texture (240-d): we first partition the gray-scale image into five regions, and then generate 24 Gabor filters in each region. The mean and standard deviation are computed by 24 Gabor filters over the 5 regions. (4) EHL (320-d): We first partition the gray-scale image into five regions, and then we extract edge histogram with 8 direction bins and 8 magnitude bins in each region. (5) EOH (657-d): We first divide the gray-scale image into sub-images by a 3x3 grid, and then we extract edge points in each grid. A 73-bin histogram is computed for each region: the first 72 bins are used to represent edge pixels by their different directions, and the last bin is the number of non-edge pixels. 1.1.2 Keypoint-based BoW features As in last year, we continue to explore the keypoint-based BoW(Bag-of-Word) features to represent each keyframe image. In our method, the extraction of keypoint-basd BoW features includes three steps: (1) Detect keypoints from the images, and use SIFT descriptor[1] to extract 128-d feature vectors for the keypoints; (2) Use k-means algorithm to cluster the keypoints into 500 clusters, and form a visual vocabulary with the cluster centroids; (3) Adopt soft-weighting[5] method to assign keypoints to multiple nearest visual words(centroids), where the word weights are determined by keypoint-to-word similarity. The normalized histogram of visual words forms a BoW feature vector. To improve the performance of BoW feature, in the step (1) we adopt six complementary detectors to detect the keypoints from images: Difference of Gaussian (DoG) [1], Laplace of Gaussian(LoG)[1], Harris Laplace[2], Dense sampling[9], Hessian Affine [3], and MSER [4]. For each detector, a 500-d feature vector is generated separately and six feature vectors are concatenated to form a 3000-d BoW feature, as shown in Figure 2. After that, we further combine it with the basic visual features in an “early fusion” manner, resulting in a 4973-d visual feature.
منابع مشابه
REGIMVID at TRECVID 2009: Semantic Access to Multimedia Data
In this paper we describe our TRECVID 2009 video retrieval experiments. The REGIMVID team participated in two tasks: High Level Feature Extraction and Automatic Search. Our TRECVID 2009 experiments focus on increasing the robustness of a small set of sensors and the relevance of the results using a probabilistic weighting of learning examples.
متن کاملPKU-ICST at TRECVID 2012: Instance Search Task
We participate in all two types of instance search task in TRECVID 2012: automatic search and interactive search. This paper presents our approaches and results. In this task, we mainly focus on exploring the effective feature representation, feature matching, re-ranking algorithm and query expansion. In feature representation, we adopt two basic visual features and five keypoint-based BoW feat...
متن کاملMSRA-USTC-SJTU at TRECVID 2007: High-Level Feature Extraction and Search
This paper describes the MSRA-USTC-SJTU experiments for TRECVID 2007. We performed the experiments in high-level feature extraction and automatic search tasks. For high-level feature extraction, we investigated the benefit of unlabeled data by semi-supervised learning, and the multi-layer (ML) multi-instance (MI) relation embedded in video by MLMI kernel, as well as the correlations between con...
متن کاملNational Institute of Informatics, Japan at TRECVID 2008
This paper reports our experiments for TRECVID 2009 tasks: high level feature extraction, search and contentbased copy detection. For the high level feature extraction task, we used the baseline features such as color moments, edge orientation histogram, local binary patterns and local features trained with SVM classifiers and nearest neighbor classifiers. For the search task, we used . Concern...
متن کاملMSRA atT TRECVID 2008: High-Level Feature Extraction and Automatic Search
This paper describes the MSRA experiments for TRECVID 2008. We performed the experiments in high-level feature extraction and automatic search tasks. For high-level feature extraction, we representatively investigated the benefit of global and local low-level features by a variety of learning-based methods, including supervised and semi-supervised learning algorithms. For automatic search, we f...
متن کامل